Phase 1.3.D + 1.3.E: text + explicit strategies + CLI (v0.3.0)#4
Conversation
Implements the two remaining pdfplumber table-finding strategies:
- text: infer column boundaries by clustering words on X0 / X1 /
centre; infer row boundaries by clustering on top-Y. Direct port of
pdfplumber's words_to_edges_v / words_to_edges_h with the same
MinWordsVertical (3) / MinWordsHorizontal (1) defaults.
- explicit: caller-supplied edges via TableSettings.Explicit*Lines.
At least two coordinates required per axis (matches pdfplumber's
validation); non-finite values dropped with a log warning.
Each axis selects its strategy independently, so mixed-strategy
settings (e.g. vertical=text + horizontal=lines) work out of the box.
- New layout.SourceText enum tagging text-derived edges.
- Page.findTableEdges refactored to dispatch per-axis on strategy
instead of starting from a single primitive-edge slice.
- ensureSupportedStrategies now only rejects unknown strategy strings.
- New table_test.go cases: unit tests on hand-crafted Words slices;
borderless / explicit / mixed extraction end-to-end on the new
testdata.TableBorderless() fixture.
- pdfplumber parity test for the borderless fixture
(TestGoldenTablesTextStrategyAgainstPdfplumber) — matches
cell-for-cell against pdfplumber's find_tables({text, text}).
- scripts/capture_pdfplumber_text_golden.py captures the
text-strategy expectation for any fixture with a sibling
.tables-text.target marker.
Adds cmd/pdftable, a stdlib-only command-line interface mirroring pdfplumber's CLI surface for the operations the library implements: - extract <file.pdf> [flags]: tables (--tables) or text (--text) on one page, a range (--pages 1,3-5), or all pages. - Output format selectable via --format json|text. JSON shape includes page dimensions, table bbox, per-cell bbox, and rows. - Full TableSettings surface exposed as flags: --vertical-strategy / --horizontal-strategy, --snap-tolerance, --join-tolerance, --edge-min-length, --intersection-tolerance, --text-tolerance, --min-words-vertical/horizontal, --explicit-vertical-lines/horizontal-lines, --indent. - Positional argument can appear before OR after flags (pdfplumber-style invocation); reorderFlagsLast() shuffles tokens so the standard library flag package can parse either ordering. Tested via cmd/pdftable/main_test.go: end-to-end runs against the issue-466-example and table-3x4-borderless fixtures, plus unit tests on parsePages, reorderFlagsLast, and the error paths. No new go.mod dependencies — uses standard library flag, encoding/json, strings, strconv only.
- CHANGELOG.md: v0.3.0 entry covering text + explicit strategies, mixed-strategy support, the pdftable CLI, the layout.SourceText enum, and the borderless parity fixture. Known limitations note the carried-over font-metric drift on cell text. - README.md: status bumped to v0.3.0; "Tables" section reworked with side-by-side pdfplumber → pdftable snippets for all four strategies plus a mixed-strategy example; new "CLI" section documenting the extract subcommand and full flag table; roadmap reflects v0.4.x as the AFM-bundle phase.
|
Warning Review limit reached
More reviews will be available in 51 minutes and 58 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (16)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Completes pdfplumber parity for the four canonical table-finding strategies. Ships:
textstrategy — column boundaries inferred from clusters of words sharing X0/X1/centre; row boundaries from clusters sharing top-Y. Direct port of pdfplumber'swords_to_edges_v/words_to_edges_h. Tunable viaMinWordsVertical(default 3) /MinWordsHorizontal(default 1).explicitstrategy — caller-supplied edges viaTableSettings.ExplicitVerticalLines/ExplicitHorizontalLines. At least two coordinates required per axis (matches pdfplumber); non-finite values dropped with alogwarning.pdftableCLI —cmd/pdftable/main.gowithextract <file.pdf> [flags]mirroring pdfplumber's CLI surface. Stdlibflagonly; no new go.mod dependencies.What's in
finder_text.go—wordsToEdgesV,wordsToEdgesH,explicitVerticalEdges,explicitHorizontalEdges,validateExplicitForStrategy.internal/layout/lines.go— newSourceTextenum value.finder.go—ensureSupportedStrategiesreworked to only reject unknown strings (all four strategies now valid).page.go—findTableEdgesrefactored to per-axis strategy dispatch via a newbaseEdgeshelper.FindTables/ExtractTablesinvokevalidateExplicitForStrategy.table.go— updated docs onTableStrategyconstants,MinWordsVertical/Horizontal, andExplicitVerticalLines/HorizontalLines.cmd/pdftable/main.go— CLI implementation.testdata/fixtures.go— newTableBorderless()helper (3-column borderless table) used by unit + parity tests.scripts/capture_pdfplumber_text_golden.py— captures pdfplumber'sfind_tables({text, text})output for fixtures with a sibling.tables-text.targetmarker.What's out
page.crop()) — used by some pdfplumber text-strategy tests (e.g.nics-background-checks-2015-11); out of scope here.Test plan
go build ./...clean.go vet ./...clean.go test -count=1 ./...— 98 passing, 0 failing across all packages.wordsToEdgesV/wordsToEdgesHon hand-crafted Word slices (alignment, threshold, empty).explicitVerticalEdges/explicitHorizontalEdges(NaN/Inf filtering, source tagging).validateExplicitForStrategy(≥2 coords required when strategy is explicit).ExtractTablestests againstTableBorderless()for text-only, explicit-only, and mixed strategies.table-3x4-borderless.pdfmatches cell-for-cell againstfind_tables({text, text}): 1 table, 7 rows × 3 cols (header + 3 alternating data/empty rows).--pagesfiltering, missing-file error, mutually-exclusive--tables --text,parsePagesparser correctness,reorderFlagsLastflag-order normalisation.go build ./cmd/pdftable && ./pdftable extract testdata/golden/issue-466-example.pdf --tables --format jsonproduces valid JSON with 2 detected tables.Parity fixtures matching
table-3x4-borderless(text strategy, both axes) — 1 table × 7 rows × 3 cols, cell-for-cell match.issue-466-example,hello,rules,simple1) continue to pass.Roughly added
~2,150 lines (1,146 + 737 + ~270) across implementation, CLI, tests, scripts, and docs.
Do not merge
Awaiting review.